1 Overview

Our research question is: how to audit whether labels are “conceptualized consistently” across different countries and different languages. We propose a framework to audit labels themselves for world-wide inclusivity. Here, we are not concerned with a specific ML implementation, and instead audit the labels that are being fed into systems.

Figure 1 is descriptive statistics of our survey.

  1. We consider country comparisons in three ways:
  • Within each country, how much “internal” agreement is there on labels of interest? (Figure 2a)
  • How do different populations view these labels in specific contexts? (Figure2b)
  • If label outcome differences are found across countries, is there a data-backed explanation relating to culture? (Figure3)
  1. We consider language comparison in three ways:
  • How similar are label choices for bilingual speakers when asked questions in their native language versus English? (Figure4)
  • How similar are label choices for English-speakers across English-speaking countries? (Figure5)
  • How similar are label choices for speakers of non-English language? (Figure6)

Figure 5 and 6 should look similar to Figure 2b using similar approach and have not been plotted currently in this doc.

bookdown::serve_book()

2 Figure 1: Discriptive Statistics

source("read.R")

head(df)
## # A tibble: 6 × 17
##   Response.ID    game  country GENDER   AGE ageCut native Langu…¹ region exper…²
##   <chr>          <chr> <fct>   <chr>  <int> <fct>  <fct>  <chr>   <fct>    <dbl>
## 1 US_English_No… PUBG  US      Male      45 45-55  Engli… English US           5
## 2 US_English_No… PUBG  US      Male      45 45-55  Engli… English US           5
## 3 US_English_No… PUBG  US      Male      45 45-55  Engli… English US           5
## 4 US_English_No… PUBG  US      Male      45 45-55  Engli… English US           5
## 5 US_English_No… PUBG  US      Male      45 45-55  Engli… English US           5
## 6 US_English_No… PUBG  US      Male      45 45-55  Engli… English US           5
## # … with 7 more variables: hardcore <fct>, accessibility <fct>, labeler <chr>,
## #   ambassador <chr>, label <fct>, answer <int>, gender <chr>, and abbreviated
## #   variable names ¹​Language, ²​experience
## # ℹ Use `colnames()` to see all variable names

2.1 Participant breakdown by gender, ageCut, hardcore, accessibility, country, game

# 5,991 participants in total
nrow(df %>% distinct(Response.ID))
## [1] 5991
# gender, age, experience, accessibility (disability) breakdowns
# gender: Male, Female, and Nonbinary. 
df %>% distinct(gender, Response.ID) %>% count(gender)
## # A tibble: 3 × 2
##   gender        n
##   <chr>     <int>
## 1 Female      432
## 2 Male       5497
## 3 NonBinary    62
# ageCut: 18-24, 25-34, 35-44, 45-55, 55+. Per Xbox internal age breakdowns.
df %>% distinct(ageCut, Response.ID) %>% count(ageCut)
## # A tibble: 6 × 2
##   ageCut     n
##   <fct>  <int>
## 1 18-24   1182
## 2 25-34   1620
## 3 35-44   1641
## 4 45-55    773
## 5 55+      111
## 6 <NA>     664
# hardcore: hardcore (play games over 1 time a week), non-hardcore (play games 1 time or less)
df %>% distinct(hardcore, Response.ID) %>% count(hardcore)
## # A tibble: 2 × 2
##   hardcore         n
##   <fct>        <int>
## 1 hardcore      5608
## 2 non-hardcore   383
# accessibility: or really disability, Yes/No. Those people who incated that they have disability or not
df %>% distinct(accessibility, Response.ID) %>% count(accessibility)
## # A tibble: 4 × 2
##   accessibility            n
##   <fct>                <int>
## 1 No                    5466
## 2 Prefer not to answer   112
## 3 Yes                    328
## 4 <NA>                    85
# countries 
df %>% distinct(country, Response.ID) %>% count(country)
## # A tibble: 16 × 2
##    country          n
##    <fct>        <int>
##  1 US            3040
##  2 Germany        245
##  3 Poland         123
##  4 Greece         319
##  5 Japan          237
##  6 Korea          165
##  7 Singapore       69
##  8 India           99
##  9 Saudi.Arabia    35
## 10 South.Africa   212
## 11 Nigeria        170
## 12 Brazil         280
## 13 Argentina      182
## 14 Colombia       165
## 15 Chile          193
## 16 Mexico         457
# games
df %>% distinct(game, Response.ID) %>% count(game)
## # A tibble: 11 × 2
##    game                              n
##    <chr>                         <int>
##  1 Animal Crossing: New Horizons  1102
##  2 Call of Duty: Vanguard         1718
##  3 Candy Crush                     799
##  4 Elden Ring                     1896
##  5 FIFA22                         1123
##  6 Grand Theft Auto V             3168
##  7 Mario Kart 8                   1651
##  8 Minecraft                      2631
##  9 PUBG                           1241
## 10 Stardew Valley                 1043
## 11 The Sims 3                      429

2.2 Label correlations

  • In this plot, we look at correlation between labels from our sample. We can observe that difficulty, violent, action, action.motivation, control_complexity, strategy, learning_curve tend to go together. Comedic, creativity, pacifist, zen,made.for.kids, cozy are clustered together.
library(corrplot)
## corrplot 0.92 loaded
# disregard the NA
res <- df %>% 
  filter(!(label %in% c("NA.positive.opinion", "NA.negative.opinion", "NA.feeling", "NA.art"))) %>%
  mutate(label=factor(label, levels=unique(df$label))) %>%  # keep order in the final graph Figure 1
  select(Response.ID, label, answer) %>% 
  pivot_wider(., names_from = label, values_from = answer) %>%
  unnest() %>% 
  select(-Response.ID)

mx <- cor(res, use = "complete.obs")
corrplot(mx, type = "upper", order = "FPC", tl.cex=2,
         tl.col = "black", tl.srt = 45, title="Figure 1: Label Correlation")

3 Figure 2a: Consider country comparisons in three ways

This step is optional. We remove games where a country has very few response (< 5). In other words, we find games with >=5 participants across all countries.

# filter out countries with few data point
# for each game, remove countries with few data points
constraints <- df %>% count(game, country, label) %>%  # group by country and label 
  select(game, country, n) %>% distinct() %>%
  filter(n < 5)

# remove games that have too few response in any country
df <- df %>% filter(!(game %in% unique(constraints$game))) %>%
  filter(!(label %in% c("NA.positive.opinion", "NA.negative.opinion", "NA.feeling", "NA.art")))

# game included
unique(df$game)
## [1] "PUBG"                   "Grand Theft Auto V"     "Minecraft"             
## [4] "Elden Ring"             "FIFA22"                 "Call of Duty: Vanguard"

3.1 Within each country, how much “internal” agreement is there on labels of interest?

We seek to answer questions such as does the concept of a “cozy game” elicit more agreement within the US than it does in Japan? To achieve so, we calculate the average standard deviation/variance of label outcomes and comparing across countries. We plot standard deviation to reduce the length of this document.

Code Description: plotFigure2a is a wrapper function that plots all 28 label output. The first three are 1-3 scale, 1-4 scale for difficulty the latter 24 are 0-1 scale. plotVariance is the plotting function.

# wrapper function that enable individual game
plotFigure2a <- function(game = "All", sd=FALSE, debug=FALSE) {
  var.df <- df %>% group_by(game, country, label) %>%
    summarise(mean_answer = mean(answer, na.rm=TRUE),
          variance = var(answer, na.rm=TRUE),
          sd=sd(answer, na.rm=TRUE))
  
  if(game == "All" && !sd) {
    # average the variances across games for each country
    country.df <- var.df %>% group_by(country, label) %>% 
      summarise(y=mean(variance, na.rm=TRUE))
  } else if (game == "All" && sd) {
    country.df <- var.df %>% group_by(country, label) %>% 
      summarise(y=mean(sd, na.rm=TRUE))
  } else if(game != "All" && !sd) {
    country.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=variance)
  } else if (game != "All" && sd) {
    country.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=sd)
  } else {
    print("Error...")
  }
  
  p1 <- plotVariance(country.df, gameTitle=game, sd=sd, binary=FALSE)
  p2 <- plotVariance(country.df, gameTitle=game, sd=sd, binary=TRUE)
  res <- ggarrange(p1, p2, ncol=1, heights = c(1, 6))
  return(res)
}

# plot variance of a label across countries
plotVariance <- function(inputDf, gameTitle="All", sd=FALSE, binary=FALSE) {
  # color the US
  inputDf$color <- inputDf$country == "US"
  
  plotDf <- inputDf[with(inputDf, order(label, country, y)),] %>% 
      filter(!label %in% c("control_complexity", "learning_curve", "difficulty", "replayability")) 
      
  if(!binary) {
    plotDf <- inputDf[with(inputDf, order(label, country, y)),] %>% 
      filter(label %in% c("control_complexity", "learning_curve", "difficulty", "replayability"))
  } 
  if(sd) {
    yLabel <- "Standard Devaition"
  } else {
    yLabel <- "Variance"
  }
  p <- plotDf %>% ggplot(aes(x=reorder_within(country, y, label), y=y, color=color)) +
        geom_point() + 
        scale_x_reordered() +
        facet_wrap(~label, ncol=4, scales = "free") +
        ylim(c(0, 3)) +
        scale_color_manual(values=c("#999999", "#56B4E9")) +
        labs(y = yLabel, x = NULL,
             title = paste0("How much internal agreement is there on labels of interest (", gameTitle, ")")) 
  if(binary) {
    p <- p + ylim(c(0, 1))
  }
  return(p)
}

3.1.1 Taken all games together.

Here we look at the standard deviation across all the games. We look at all the games together. (e.g., Does the concept of a “cozy game” elicit more agreement within the US than it does in Japan regardless of games?) The US is highlighted in the graph in Figure 2a.

We can observe that Saudi Arabia with the highest variability (i.e., standard deviation) 11 times, South Africa 2 times, Nigeria 7 times, Singapore 4 times, Japan 1 time, India 1 time, Mexico 1 time, and Colombia 1 time.

plotFigure2a("All", sd=TRUE, debug=FALSE)

3.1.2 Breakdown by game

Here we look at the variance/standard deviation in individual games. We look at all the games together. (e.g., Does the concept of a “cozy game” elicit more agreement within the US than it does in Japan in specific games?)

Note that not in one time did the US has the highest standard deviation.

3.1.2.1 PUBG

plotFigure2a("PUBG", sd=TRUE, debug=FALSE)
## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

3.1.2.2 Grand Theft Auto V

plotFigure2a("Grand Theft Auto V", sd=TRUE, debug=FALSE)
## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

3.1.2.3 Minecraft

plotFigure2a("Minecraft", sd=TRUE, debug=FALSE)
## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

3.1.2.4 Elden Ring

plotFigure2a("Elden Ring", sd=TRUE, debug=FALSE)
## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

3.1.2.5 FIFA22

plotFigure2a("FIFA22", sd=TRUE, debug=FALSE)
## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

3.1.2.6 Call of Duty: Vanguard

plotFigure2a("Call of Duty: Vanguard", sd=TRUE, debug=FALSE)
## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

3.2 Within each region, how much “internal” agreement is there on labels of interest?

In this case, we group countries to higher-level regions based on the World Value Survey map. We perform the same analysis but on a regional level.

# wrapper function that enable individual game
plotFigure2aRegion <- function(game = "All", sd=FALSE, debug=FALSE) {
  var.df <- df %>% group_by(game, region, label) %>%
    summarise(mean_answer = mean(answer, na.rm=TRUE),
          variance = var(answer, na.rm=TRUE),
          sd=sd(answer, na.rm=TRUE))
  
  if(game == "All" && !sd) {
    # average the variances across games for each region
    region.df <- var.df %>% group_by(region, label) %>% 
      summarise(y=mean(variance, na.rm=TRUE))
  } else if (game == "All" && sd) {
    region.df <- var.df %>% group_by(region, label) %>% 
      summarise(y=mean(sd, na.rm=TRUE))
  } else if(game != "All" && !sd) {
    region.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=variance)
  } else if (game != "All" && sd) {
    region.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=sd)
  } else {
    print("Error...")
  }
  
  p1 <- plotVariance(region.df, gameTitle=game, sd=sd, binary=FALSE)
  p2 <- plotVariance(region.df, gameTitle=game, sd=sd, binary=TRUE)
  res <- ggarrange(p1, p2, ncol=1, heights = c(1, 6))
  return(res)
}

# plot variance of a label across countries
plotVariance <- function(inputDf, gameTitle="All", sd=FALSE, binary=FALSE) {
  # color the US
  inputDf$color <- inputDf$region == "US"
  
  plotDf <- inputDf[with(inputDf, order(label, region, y)),] %>% 
      filter(!label %in% c("control_complexity", "learning_curve", "difficulty", "replayability")) 
      
  if(!binary) {
    plotDf <- inputDf[with(inputDf, order(label, region, y)),] %>% 
      filter(label %in% c("control_complexity", "learning_curve", "difficulty", "replayability"))
  } 
  if(sd) {
    yLabel <- "Standard Devaition"
  } else {
    yLabel <- "Variance"
  }
  p <- plotDf %>% ggplot(aes(x=reorder_within(region, y, label), y=y, color=color)) +
        geom_point() + 
        scale_x_reordered() +
        facet_wrap(~label, ncol=4, scales = "free") +
        ylim(c(0, 3)) +
        scale_color_manual(values=c("#999999", "#56B4E9")) +
        labs(y = yLabel, x = NULL,
             title = paste0("How much internal agreement is there on labels of interest (", gameTitle, ")")) 
  if(binary) {
    p <- p + ylim(c(0, 1))
  }
  return(p)
}

3.2.1 Taken all games together.

plotFigure2aRegion("All", sd=TRUE, debug=FALSE)

4 Figure 2b: How do different populations view these labels in specific contexts?

To answer this question, we calculate differences in estimated mean label outcomes for each demographic using causal matching analysis. To predict the probability of having a game label in a certain country, we use multilevel regression and post-stratification (MRP).

# Import libraries
source("read.R")
library(ggalt)
library(reshape2)

options(dplyr.summarise.inform = FALSE)

4.1 Calculate differences in estimated mean label outcomes for each demographic using causal matching analysis

We first check the differences between means for each labels across countries below. We run the following analyzeCountry function to each label and explores whether there is a significant difference between US and non-US participants in our matched dataframe.

analyzeCountry <- function(inputDf, label, lower=1, upper=3) {
  df <- inputDf %>% filter(label == {{ label }}) %>% 
    mutate(is_us = ifelse(country == "US", 1, 0)) %>% 
    mutate(gender.no.nonbinary = ifelse(gender == "Male", 1, 0)) %>% # depending on how we want to deal with this
    mutate(gender.no.nonbinary = factor(gender.no.nonbinary)) %>%
    na.omit() 
  
  # causal matching using matchit
  m.out <- matchit(is_us ~ game + ageCut + hardcore + gender.no.nonbinary, method = "nearest", distance = "mahalanobis", link = "probit", replace = TRUE, data=df)
  
  # optional for debugging  
  s.out <- summary(m.out, standardize = TRUE)
  plot(s.out)
  
  matched.df <- match.data(m.out) 
  matched.df$is_us <- factor(matched.df$is_us)
  model <- lm(answer ~ is_us, data=matched.df)
  print(summary(model))

  annotator <- df %>% filter(labeler == "Yes" & label == {{ label }}) %>%
    count(game, answer) %>%
    group_by(game) %>%
    summarize(majority.vote = mean(answer[which(n==max(n))])) %>% ungroup()
  
  p <- plotByGame(matched.df, "is_us", label, LOWER=lower, UPPER = upper) +
    geom_point(aes(x=game, y=majority.vote, color="Polish Annotator",  shape="Polish Annotator"), size=2, data=annotator) +  # majority.vote is the Polish annotator majority vote
    labs(title=paste0(label, " Score by Country"), y="Mean Score", x="Games") +
    scale_color_discrete("Country")
  return(p)
}
  • In the remainder of this section, we look at each individual label. For each label, we can look at whether there is a significant difference between US and non-US participants. For example, learning curve, replaybility, zen, space, violent, action, comedic, grinding, anime, motivation:action, motivation:social, motivation:immersion are significantly different.

4.1.1 Control Complexity

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6644 -0.6257  0.3356  0.3743  1.3743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.66443    0.05624  29.597   <2e-16 ***
## is_us1      -0.03875    0.05673  -0.683    0.495    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6865 on 8669 degrees of freedom
## Multiple R-squared:  5.384e-05,  Adjusted R-squared:  -6.151e-05 
## F-statistic: 0.4667 on 1 and 8669 DF,  p-value: 0.4945

4.1.2 Learning Curve

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7651 -0.6227 -0.6227  0.3773  1.3773 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.76510    0.06333  27.871   <2e-16 ***
## is_us1      -0.14236    0.06388  -2.228   0.0259 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7731 on 8669 degrees of freedom
## Multiple R-squared:  0.0005725,  Adjusted R-squared:  0.0004572 
## F-statistic: 4.966 on 1 and 8669 DF,  p-value: 0.02588

4.1.3 Difficulty

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90894 -0.90894  0.09106  0.10067  2.10067 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.899329   0.076427  24.852   <2e-16 ***
## is_us1      0.009613   0.077092   0.125    0.901    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9329 on 8669 degrees of freedom
## Multiple R-squared:  1.793e-06,  Adjusted R-squared:  -0.0001136 
## F-statistic: 0.01555 on 1 and 8669 DF,  p-value: 0.9008

4.1.4 Replayability

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4363 -0.4363  0.5637  0.5637  0.7047 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.29530    0.05902  38.892   <2e-16 ***
## is_us1       0.14098    0.05953   2.368   0.0179 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7204 on 8669 degrees of freedom
## Multiple R-squared:  0.0006465,  Adjusted R-squared:  0.0005313 
## F-statistic: 5.608 on 1 and 8669 DF,  p-value: 0.0179

4.1.5 Pacifist

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2013 -0.1641 -0.1641 -0.1641  0.8359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20134    0.03039   6.626 3.65e-11 ***
## is_us1      -0.03730    0.03065  -1.217    0.224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3709 on 8669 degrees of freedom
## Multiple R-squared:  0.0001708,  Adjusted R-squared:  5.544e-05 
## F-statistic: 1.481 on 1 and 8669 DF,  p-value: 0.2237

4.1.6 Made for kids

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3557 -0.3202 -0.3202  0.6798  0.6798 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.35570    0.03824   9.301   <2e-16 ***
## is_us1      -0.03547    0.03858  -0.920    0.358    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4668 on 8669 degrees of freedom
## Multiple R-squared:  9.754e-05,  Adjusted R-squared:  -1.781e-05 
## F-statistic: 0.8456 on 1 and 8669 DF,  p-value: 0.3578

4.1.7 Cozy

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3044 -0.3044 -0.3044  0.6956  0.7248 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.27517    0.03768   7.302 3.08e-13 ***
## is_us1       0.02922    0.03801   0.769    0.442    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.46 on 8669 degrees of freedom
## Multiple R-squared:  6.817e-05,  Adjusted R-squared:  -4.718e-05 
## F-statistic: 0.591 on 1 and 8669 DF,  p-value: 0.4421

4.1.8 Zen

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2773 -0.2773 -0.2773  0.7227  0.7920 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20805    0.03662   5.681 1.38e-08 ***
## is_us1       0.06923    0.03694   1.874    0.061 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.447 on 8669 degrees of freedom
## Multiple R-squared:  0.000405,   Adjusted R-squared:  0.0002897 
## F-statistic: 3.512 on 1 and 8669 DF,  p-value: 0.06096

4.1.9 Fantasy

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3340 -0.3340 -0.3340  0.6660  0.6711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.32886    0.03864   8.511   <2e-16 ***
## is_us1       0.00510    0.03898   0.131    0.896    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4716 on 8669 degrees of freedom
## Multiple R-squared:  1.975e-06,  Adjusted R-squared:  -0.0001134 
## F-statistic: 0.01712 on 1 and 8669 DF,  p-value: 0.8959

4.1.10 Space

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.08054 -0.01877 -0.01877 -0.01877  0.98123 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08054    0.01141   7.061 1.78e-12 ***
## is_us1      -0.06176    0.01150  -5.368 8.15e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1392 on 8669 degrees of freedom
## Multiple R-squared:  0.003313,   Adjusted R-squared:  0.003198 
## F-statistic: 28.82 on 1 and 8669 DF,  p-value: 8.151e-08

4.1.11 Heroic

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2537 -0.2537 -0.2537  0.7463  0.8054 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.19463    0.03560   5.467 4.69e-08 ***
## is_us1       0.05907    0.03591   1.645      0.1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4345 on 8669 degrees of freedom
## Multiple R-squared:  0.000312,   Adjusted R-squared:  0.0001967 
## F-statistic: 2.706 on 1 and 8669 DF,  p-value: 0.1

4.1.12 Real World

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3624 -0.3545 -0.3545  0.6455  0.6455 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.362416   0.039197   9.246   <2e-16 ***
## is_us1      -0.007922   0.039538  -0.200    0.841    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4785 on 8669 degrees of freedom
## Multiple R-squared:  4.631e-06,  Adjusted R-squared:  -0.0001107 
## F-statistic: 0.04014 on 1 and 8669 DF,  p-value: 0.8412

4.1.13 Violent

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4066 -0.4066 -0.4066  0.5934  0.8054 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.19463    0.04012   4.851 1.25e-06 ***
## is_us1       0.21196    0.04047   5.237 1.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4898 on 8669 degrees of freedom
## Multiple R-squared:  0.003154,   Adjusted R-squared:  0.003039 
## F-statistic: 27.43 on 1 and 8669 DF,  p-value: 1.669e-07

4.1.14 Action

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4703 -0.4703 -0.4703  0.5297  0.6778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.32215    0.04085   7.886  3.5e-15 ***
## is_us1       0.14816    0.04121   3.596 0.000325 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4986 on 8669 degrees of freedom
## Multiple R-squared:  0.001489,   Adjusted R-squared:  0.001374 
## F-statistic: 12.93 on 1 and 8669 DF,  p-value: 0.0003253

4.1.15 Emotional

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2081 -0.1907 -0.1907 -0.1907  0.8093 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.20805    0.03221   6.460  1.1e-10 ***
## is_us1      -0.01737    0.03249  -0.535    0.593    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3931 on 8669 degrees of freedom
## Multiple R-squared:  3.298e-05,  Adjusted R-squared:  -8.237e-05 
## F-statistic: 0.2859 on 1 and 8669 DF,  p-value: 0.5929

4.1.16 Comedic

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4488 -0.4488 -0.4488  0.5512  0.6309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.36913    0.04073   9.063   <2e-16 ***
## is_us1       0.07971    0.04109   1.940   0.0524 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4972 on 8669 degrees of freedom
## Multiple R-squared:  0.000434,   Adjusted R-squared:  0.0003187 
## F-statistic: 3.764 on 1 and 8669 DF,  p-value: 0.0524

4.1.17 Experimental

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1745 -0.1632 -0.1632 -0.1632  0.8368 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17450    0.03029   5.760 8.69e-09 ***
## is_us1      -0.01127    0.03056  -0.369    0.712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3698 on 8669 degrees of freedom
## Multiple R-squared:  1.57e-05,   Adjusted R-squared:  -9.966e-05 
## F-statistic: 0.1361 on 1 and 8669 DF,  p-value: 0.7122

4.1.18 Strategy

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4488 -0.4488 -0.4488  0.5512  0.5571 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.442953   0.040750  10.870   <2e-16 ***
## is_us1      0.005885   0.041105   0.143    0.886    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4974 on 8669 degrees of freedom
## Multiple R-squared:  2.365e-06,  Adjusted R-squared:  -0.000113 
## F-statistic: 0.0205 on 1 and 8669 DF,  p-value: 0.8862

4.1.19 Grinding

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5116 -0.5116  0.4884  0.4884  0.6510 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.34899    0.04092   8.528  < 2e-16 ***
## is_us1       0.16262    0.04128   3.940 8.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4995 on 8669 degrees of freedom
## Multiple R-squared:  0.001787,   Adjusted R-squared:  0.001672 
## F-statistic: 15.52 on 1 and 8669 DF,  p-value: 8.227e-05

4.1.20 Anime

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17450 -0.07897 -0.07897 -0.07897  0.92103 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.17450    0.02228   7.831 5.40e-15 ***
## is_us1      -0.09552    0.02248  -4.250 2.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.272 on 8669 degrees of freedom
## Multiple R-squared:  0.002079,   Adjusted R-squared:  0.001964 
## F-statistic: 18.06 on 1 and 8669 DF,  p-value: 2.16e-05

4.1.21 Hand drawn

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14094 -0.09939 -0.09939 -0.09939  0.90061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.14094    0.02459   5.732 1.02e-08 ***
## is_us1      -0.04155    0.02480  -1.675   0.0939 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3001 on 8669 degrees of freedom
## Multiple R-squared:  0.0003237,  Adjusted R-squared:  0.0002083 
## F-statistic: 2.807 on 1 and 8669 DF,  p-value: 0.09391

4.1.22 Stylized

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5394 -0.5394  0.4606  0.4606  0.4832 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.51678    0.04084   12.65   <2e-16 ***
## is_us1       0.02265    0.04120    0.55    0.582    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4985 on 8669 degrees of freedom
## Multiple R-squared:  3.487e-05,  Adjusted R-squared:  -8.048e-05 
## F-statistic: 0.3023 on 1 and 8669 DF,  p-value: 0.5825

4.1.23 Motivation: Action

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5419 -0.5419  0.4581  0.4581  0.6067 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.39333    0.04067   9.671  < 2e-16 ***
## is_us1       0.14860    0.04103   3.622 0.000294 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4981 on 8649 degrees of freedom
## Multiple R-squared:  0.001514,   Adjusted R-squared:  0.001399 
## F-statistic: 13.12 on 1 and 8649 DF,  p-value: 0.0002941

4.1.24 Motivation: Social

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4985 -0.4985 -0.3467  0.5015  0.6533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.34667    0.04080   8.498  < 2e-16 ***
## is_us1       0.15186    0.04115   3.690 0.000226 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4996 on 8649 degrees of freedom
## Multiple R-squared:  0.001572,   Adjusted R-squared:  0.001456 
## F-statistic: 13.62 on 1 and 8649 DF,  p-value: 0.0002256

4.1.25 Motivation: Mastery

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4607 -0.4607 -0.4607  0.5393  0.5800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.42000    0.04070   10.32   <2e-16 ***
## is_us1       0.04065    0.04105    0.99    0.322    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4984 on 8649 degrees of freedom
## Multiple R-squared:  0.0001134,  Adjusted R-squared:  -2.252e-06 
## F-statistic: 0.9805 on 1 and 8649 DF,  p-value: 0.3221

4.1.26 Motivation: Achievement

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5292 -0.5292  0.4708  0.4708  0.5000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.50000    0.04076  12.267   <2e-16 ***
## is_us1       0.02923    0.04112   0.711    0.477    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4992 on 8649 degrees of freedom
## Multiple R-squared:  5.843e-05,  Adjusted R-squared:  -5.718e-05 
## F-statistic: 0.5054 on 1 and 8649 DF,  p-value: 0.4772

4.1.27 Motivation: Immersion

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4948 -0.4948 -0.3733  0.5052  0.6267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.37333    0.04080   9.149  < 2e-16 ***
## is_us1       0.12143    0.04116   2.950  0.00319 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4998 on 8649 degrees of freedom
## Multiple R-squared:  0.001005,   Adjusted R-squared:  0.0008897 
## F-statistic: 8.703 on 1 and 8649 DF,  p-value: 0.003186

4.1.28 Motivation: Creativity

## 
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4567 -0.4567 -0.4567  0.5433  0.5733 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.42667    0.04067  10.491   <2e-16 ***
## is_us1       0.02999    0.04103   0.731    0.465    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4981 on 8649 degrees of freedom
## Multiple R-squared:  6.175e-05,  Adjusted R-squared:  -5.386e-05 
## F-statistic: 0.5341 on 1 and 8649 DF,  p-value: 0.4649

4.2 To predict the probability of having a game label in a certain country, we use multilevel regression and post-stratification (MRP).

# Import libraries
source("read.R")
library(ggalt)
library(reshape2)

options(dplyr.summarise.inform = FALSE)

In this subsection, we try to predict a probability of labeling a certain game with a particular Genome label by different countries.

# read csv
countryLevelDf <- read_csv("demographic.csv")
## Rows: 7167 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, AgeGroup, Gender
## dbl (1): UserCount
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
countryLevelDf$UserCount <- as.numeric(countryLevelDf$UserCount)

# make the country name consistent with the individual survey dataframe
countryLevelDf <- countryLevelDf %>% 
  filter(Country %in% c("Argentina", "Brazil", "Chile", "Colombia", "Germany", "Greece", "India", "Japan", "Korea", "Mexico", "Nigeria", "Poland", "Saudi Arabia", "Singapore", "South Africa", "United States")) %>%
  dplyr::rename(ageCut = AgeGroup, country = Country, gender = Gender) %>%
  mutate(ageCut = case_when(ageCut == "18 - 24" ~ "18-24",
                                 ageCut == "25 - 34" ~ "25-34",
                                 ageCut == "35 - 44" ~ "35-44",
                                 ageCut == "45 - 55" ~ "45-55",
                                 ageCut == "> 55" ~ "55+"))
countryLevelDf$country[countryLevelDf$country == "United States"] <- "US"
countryLevelDf$country[countryLevelDf$country == "Saudi Arabia"] <- "Saudi.Arabia"
countryLevelDf$country[countryLevelDf$country == "South Africa"] <- "South.Africa"

countryLevelDf <- countryLevelDf %>% mutate(region = case_when(country %in% c("Japan", "Korea") ~ "Confucian",
                            country %in% c("Singapore", "India") ~ "West.South.Asia",
                            country %in% c("Argentina", "Chile", "Colombia", "Mexico", "Brazil") ~ "Latin.America",
                            country %in% c("Germany") ~ "Protestant.Europe",
                            country %in% c("South.Africa", "Nigeria", "Saudi.Arabia") ~ "Islamic",
                            country %in% c("Poland") ~ "Catholic.Europe",
                            country %in% c("Greece") ~ "Orthodox.Europe",
                            country %in% c("US") ~ "US"))

countryLevelDf$country <- factor(countryLevelDf$country)
countryLevelDf$ageCut <- factor(countryLevelDf$ageCut, c("18-24","25-34", "35-44", "45-55", "55+"))
countryLevelDf$gender <- factor(countryLevelDf$gender)
countryLevelDf$region = factor(countryLevelDf$region)
# Calculate the percentage of age+gender group by country
countryLevelDf <- countryLevelDf %>% group_by(country) %>% 
  mutate(UserCount.per = UserCount/sum(UserCount, na.rm=TRUE))
countryLevelDf <- countryLevelDf %>% filter(ageCut != "" & ageCut != "Unknown" & !is.na(ageCut)) %>%
  filter(gender != "" & gender != "Unknown" & !is.na(gender)) %>%
  mutate(gender = ifelse(gender == "Male", "Male", "Female or Nonbinary"))

4.3 Exploratory Data Analysis

We examine the difference between our sample of ~5000 participants and the true MS population.

source("read.R")
df <- df %>% mutate(gender = ifelse(gender == "Male", "Male", "Female or Nonbinary"))
age_sample <- df %>%  
  dplyr::mutate(age = factor(ageCut, ordered = FALSE)) %>% 
  dplyr::group_by(age) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::mutate(Sample = n/sum(n))

age_post <- countryLevelDf %>% 
  dplyr::mutate(age = factor(ageCut, ordered = FALSE)) %>% 
  dplyr::group_by(age) %>% 
  dplyr::summarise(n_post = sum(UserCount)) %>% 
  dplyr::mutate(Population = n_post/sum(n_post))
age <- dplyr::inner_join(age_sample, age_post, by = "age") %>% select(age, Sample, Population)

age_plot <- ggplot() + 
  ylab("") + xlab("Proportion") + theme_bw() + coord_flip()  + 
  geom_dumbbell(data = age, aes(y = age, x = Sample, xend = Population)) +
  geom_point(data = melt(age, id = "age"), aes(y = age, x = value, color = variable), size = 2) +
  scale_x_continuous(limits = c(0, 0.5), breaks = c(0, .1, .2, .3, .4, .5)) + 
  ggtitle("Age")
age_plot

# Gender because the poststratification dataset does not have nonbinary 
male_sample <- df %>% 
  dplyr::mutate(gender = ifelse(gender != "Male", "Female or Nonbinary", "Male")) %>%
  dplyr::group_by(gender) %>% 
  dplyr::summarise(n = n()) %>% 
  dplyr::mutate(Sample = n/sum(n))
male_post <- countryLevelDf %>% 
  dplyr::group_by(gender) %>% 
  dplyr::summarise(n_post = sum(UserCount)) %>% 
  dplyr::mutate(Population = n_post/sum(n_post))
male <- dplyr::inner_join(male_sample, male_post, by = "gender") %>% select(gender, Sample, Population)
male_plot <- ggplot() + 
  ylab("") + xlab("") + theme_bw() + coord_flip()  + 
  geom_dumbbell(data = male, aes(y = gender, x = Sample, xend = Population)) +
  geom_point(data = melt(male, id = "gender"), aes(y = gender, x = value, color = variable), size = 2) +
  scale_x_continuous(limits = c(0, 1), breaks = c(0, .2, .4, .6, .8, 1.0)) + ggtitle("Gender")
male_plot

4.3.1 Poststratification for country

  • Here we further show that for the US, it is way higher likely that these games such as PUBG, Elden Ring, GTA, Call of Duty to be violent.

4.3.1.1 TODO: Add the hardcode percentage before merging the table

library(data.table)
poststratify <- function(game, label) {
  temp <- df %>% filter(label == {{ label }} & game == {{ game }}) %>% 
    filter(accessibility %in% c("Yes", "No")) %>% 
    mutate(gender = ifelse(gender == "Male", "Male", "Female or Nonbinary")) %>%
    na.omit()

  mod <- glm(answer ~ hardcore + accessibility + country + gender + ageCut, data = temp, family = binomial(link="logit"))
  post <- expand.grid(accessibility=unique(temp$accessibility), 
                      country=unique(temp$country), 
                      gender=unique(temp$gender), 
                      ageCut=unique(temp$ageCut),
                      hardcore=unique(temp$hardcore))

  post <- post %>% 
    mutate(gender = ifelse(gender != "Male", "Female", "Male")) %>% 
    left_join(countryLevelDf, by=c("country", "gender", "ageCut")) %>%
    mutate(gender = ifelse(gender != "Male", "Female or Nonbinary", "Male"))
  
  post <- post %>% 
    mutate(UserCount.per = ifelse(accessibility == "No", UserCount.per * 0.1, UserCount.per * 0.9))  # heuristic about accessibility 
    
  post$prediction <- predict(mod, newdata=post, type="response", allow.new.levels=TRUE)
  post$weight.pred <- post$prediction*post$UserCount.per 
  results <- data.table(post)[ , .(final.est = sum(weight.pred, na.rm=TRUE)), by = .(country)]
  
  return(results)
}
labelLst = c("pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent", 
             "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn", 
             "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
gameLst <- unique(df$game)
predicted.df <- data.frame(country=c(), final.est=c(), label=c(), game = c())
# Calculate the poststratified probability
for(g in gameLst) {
  for(l in labelLst) {
    label.df <- poststratify(g, l)
    label.df$label <- l
    label.df$game <- g
    predicted.df <- rbind(predicted.df, label.df)
  }
}
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# View the countries with a probability > 0.75
predicted.df %>% filter(final.est > 0.5 & label == "violent")
##          country final.est   label                   game
##  1:           US 0.7451460 violent                   PUBG
##  2:       Greece 0.5973012 violent                   PUBG
##  3:        Japan 0.5699150 violent                   PUBG
##  4:       Mexico 0.5149857 violent                   PUBG
##  5:       Poland 0.5240003 violent                   PUBG
##  6:           US 0.8926281 violent     Grand Theft Auto V
##  7:    Singapore 0.6421735 violent     Grand Theft Auto V
##  8: South.Africa 0.5146014 violent     Grand Theft Auto V
##  9:       Brazil 0.5838548 violent     Grand Theft Auto V
## 10:        Chile 0.5220274 violent     Grand Theft Auto V
## 11:     Colombia 0.5268495 violent     Grand Theft Auto V
## 12:      Germany 0.7017908 violent     Grand Theft Auto V
## 13:       Greece 0.7147602 violent     Grand Theft Auto V
## 14:        Japan 0.5895735 violent     Grand Theft Auto V
## 15:        Korea 0.6300997 violent     Grand Theft Auto V
## 16:       Mexico 0.6231587 violent     Grand Theft Auto V
## 17:       Poland 0.5453907 violent     Grand Theft Auto V
## 18:           US 0.8339568 violent             Elden Ring
## 19:    Singapore 0.5073604 violent             Elden Ring
## 20:      Germany 0.5004755 violent             Elden Ring
## 21:       Greece 0.5778392 violent             Elden Ring
## 22:       Poland 0.5420260 violent             Elden Ring
## 23:           US 0.6905598 violent Call of Duty: Vanguard
## 24:      Germany 0.5635139 violent Call of Duty: Vanguard
## 25:       Greece 0.5231360 violent Call of Duty: Vanguard
##          country final.est   label                   game

4.3.2 Alternative approach using rstanarm

https://bookdown.org/jl5522/MRP-case-studies/introduction-to-mrp.html

# res <- df %>% filter(label == "made.for.kids") %>%
#   filter(!label %in% c("NA.positive.opinion", "NA.negative.opinion", "NA.feeling", "NA.art")) %>%
#   filter(!is.na(country) & !is.na(gender) & !is.na(Language) & !is.na(ageCut)) %>%
#   mutate(male = ifelse(gender == "Male", 1, 0)) %>% 
#   mutate(male = as.factor(male)) %>%
#   select(Response.ID, country, male, ageCut, region, answer)
# library(rstanarm)
# fit <- stan_glmer(answer ~  (1|country) + (1|ageCut) + male + region,
#   family = binomial(link = "logit"),
#   data = res,
#   prior = normal(0, 1, autoscale = TRUE),
#   prior_covariance = decov(scale = 0.50),
#   adapt_delta = 0.99,
#   refresh = 0,
#   seed = 1010)

4.4 Poststratification for cultural correlation analysis (comparing label correlation and Hofstede’s correlation)

library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

Below, I will show a Multi-level regression and poststratification (MRP) on one label (violent) and one game (Grand Theft Auto V) as an example. I will use the same approach to generate a prediction for all label-game pair for different countries. In the end, I will use the predicted value as element in the label vector to compare with Hofstede’s cultural dimensions.

4.4.1 MPR Example

4.4.1.1 Prep the groundtruth dataset

# First we need to prep the groundtruth dataset when merging the groundtruth countryLevelDf for post-stratification
# Make the dataset from a long format to a wide format
wide.df <- countryLevelDf %>% dplyr::select(country, gender, ageCut, UserCount) %>%
  pivot_wider(names_from = country, values_from = UserCount) 
wide.df <- wide.df[with(wide.df, order(gender, ageCut)),]
wide.df <- wide.df %>% dplyr::select(-ageCut, -gender)

# Ascertain the order of the random variables
country.order <- colnames(wide.df)
age.order <- unique(df$ageCut[!is.na(df$ageCut)])
gender.order <- unique(df$gender[!is.na(df$gender)])
# Filter a smaller dataset and generate the model
violent <- df %>% filter(label == "violent" & game == "Grand Theft Auto V") 
mod <- glmer(answer ~ 1 + (1|gender) + (1|ageCut) + (1|country), data=violent, family = binomial("probit"))

4.4.1.2 Generate Prediction for ideal types (5 ageCuts * 2 genders * 16 countries)

# We have 10 individual level (5 ageCut * 2 gender) living in 16 countries. There are 10*16 specific ideal types. 
# Reference of the implementation: https://github.com/lleemann/MrP_chapter/blob/master/MrP_Illsutration.pdf
re.gender.coef <- ranef(mod)$gender
re.age.coef <- ranef(mod)$ageCut
re.country.coef <- ranef(mod)$country

rownames(re.gender.coef) <- gender.order
rownames(re.age.coef) <- age.order
rownames(re.country.coef) <- country.order

re.gender <- re.gender.coef[[1]]
re.ageCut <- re.age.coef[[1]]
re.country <- re.country.coef[[1]]

gender.re <- rep(re.gender, 5)
age.re <- rep(kronecker(re.ageCut, c(1)), 1)
ind.re <- rowSums(cbind(gender.re, age.re))
ind.re <- ind.re + fixef(mod)
y.lat <- rep(NA, 16*10)
for(i in 1:16) {
  a <- ((i-1)*10)+1
  b <- a + 9
  y.lat[a:b] <- ind.re + re.country[i]
}
# Here we use pnorm to transform the scores in y.lat1 to predicted probability 
# In other words, what we have doe was on the latent variable, but we are really interested in the predicted probability
p1 <- pnorm(y.lat)  
# p1 is the vector for 10 individual levels in 16 countries
# It's essentially the random effect coefficients
length(p1)  
## [1] 160

4.4.1.3 Post-stratify by ideal types

# Next, we can post-stratify and make predictions
a <- c(data.matrix(wide.df))  # get the number of groudtruth user count in the 160 categories above

# Post-stratified number for a label for a game (in this case "violent", GTA V)
# This number shows the prediction regardless of countries: pretty consistent hah
sum(p1*a)/sum(a)
## [1] 0.7234758
# Now we post-stratify the individual levels (5 ageCut * 2 gender) for each country
country.pred <- rep(NA, 16) 
for(i in 1:16) {
  # a1, a2 are indices 
  a1 <- ((i-1)*10) + 1
  a2 <- a1 + 9
  p1 <- pnorm(y.lat[a1:a2])
  a <- wide.df[,i]  # We don't need to worry about the order because we have considered the order when we generate y.lat1
  country.pred[i] <- sum(p1*a) / sum(a)
}

# Chile is really not thinking that GTA is a violent game
plot.df <- data.frame(country = factor(colnames(wide.df), levels=colnames(wide.df)), predicted=country.pred)
ggplot(aes(x=reorder(country, predicted), y=predicted), data=plot.df) +
  geom_point()

4.4.2 MRP for all label-game pairs for each country

The goal here is to create a dataframe with columns: label, game, country, glm_pred, mrp_adjusted

# Let's consider 0-1 vector first
labels <- c("pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn", "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
games <- unique(df$game)

# This function repeats what we have done in the example above
getMRPPrediction <- function(mod) {
  re.gender.coef <- ranef(mod)$gender
  re.age.coef <- ranef(mod)$ageCut
  re.country.coef <- ranef(mod)$country
  
  rownames(re.gender.coef) <- gender.order
  rownames(re.age.coef) <- age.order
  rownames(re.country.coef) <- country.order
  
  re.gender <- re.gender.coef[[1]]
  re.ageCut <- re.age.coef[[1]]
  re.country <- re.country.coef[[1]]

  gender.re <- rep(re.gender, 5)
  age.re <- rep(kronecker(re.ageCut, c(1)), 1)
  ind.re <- rowSums(cbind(gender.re, age.re))
  ind.re <- ind.re + fixef(mod)
  y.lat <- rep(NA, 16*10)
  
  for(i in 1:16) {
    a <- ((i-1)*10)+1
    b <- a + 9
    y.lat[a:b] <- ind.re + re.country[i]
  }
  p1 <- pnorm(y.lat)
 
  # post-stratification prediction 
  country.pred <- rep(NA, 16)
  for(i in 1:16) {
    a1 <- ((i-1)*10) + 1
    a2 <- a1 + 9
    p1 <- pnorm(y.lat[a1:a2])
    a <- wide.df[,i]
    country.pred[i] <- sum(p1*a) / sum(a)
  }
  
  # append MRP prediction 
  output.df <- data.frame(country = factor(colnames(wide.df), levels=country.order), mrp_pred=country.pred)
  return(output.df)
}


getGLMPrediction = function(mod) {
  pred <- expand.grid(country=country.order)
  pred$glm_pred <- predict(mod, newdata=pred, type="response", allow.new.levels=TRUE)
  return(pred)
}

# Here we create the final.df which is the post-stratified prediction for each country
final.df <- data.frame(label=c(), country=c(), game=c(), glm_pred=c(), mrp_adjusted=c())
for(label in labels){
  for(game in games){
    label.df <- df %>% filter(label == {{ label }} & game == {{ game }})
    mod.glm <- glm(answer ~ country, data=label.df, family = binomial(link="logit"))
    mod.mrp <- glmer(answer ~ 1 + (1|gender) + (1|ageCut) + (1|country), data=label.df, family = binomial("probit"))    
    
    glm.pred <- getGLMPrediction(mod.glm)
    mrp.pred <- getMRPPrediction(mod.mrp)
    
    mrp.pred <- mrp.pred %>% left_join(glm.pred, by="country")
    mrp.pred$game <- game
    mrp.pred$label <- label
    final.df <- rbind(mrp.pred, final.df)
  }
}
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
head(final.df)
##        country  mrp_pred  glm_pred       game      label
## 1 South.Africa 0.8286466 0.7692308 The Sims 3 creativity
## 2     Colombia 0.8376679 0.7500000 The Sims 3 creativity
## 3       Greece 0.8194868 0.8666667 The Sims 3 creativity
## 4        Korea 0.8249671 0.9999999 The Sims 3 creativity
## 5        India 0.8325641 0.9999999 The Sims 3 creativity
## 6        Japan 0.8127111 0.6250000 The Sims 3 creativity
## We need to filter out country with very low survey responses
filter.df <- df %>% group_by(label, country, game) %>% 
    filter(n() >= 5) %>%
    count()

final.df <- final.df %>% 
  left_join(filter.df, by=c("country", "game", "label")) %>%
  mutate(mrp_pred=ifelse(n < 5, NA, final.df$mrp_pred)) %>%
  mutate(glm_pred=ifelse(n < 5, NA, final.df$glm_pred))
head(final.df)
##        country  mrp_pred  glm_pred       game      label  n
## 1 South.Africa 0.8286466 0.7692308 The Sims 3 creativity 13
## 2     Colombia 0.8376679 0.7500000 The Sims 3 creativity 12
## 3       Greece 0.8194868 0.8666667 The Sims 3 creativity 15
## 4        Korea 0.8249671 0.9999999 The Sims 3 creativity  6
## 5        India        NA        NA The Sims 3 creativity NA
## 6        Japan 0.8127111 0.6250000 The Sims 3 creativity  8

4.4.3 Heatmap

# The post-stratified prediction
final.df %>% ggplot(aes(x=country, y=label, fill=mrp_pred)) +
  geom_tile() +
  facet_wrap(~ game, scales="free", ncol=3)

4.4.4 Comparing post-stratified results with the Hofstede’s Dimensions

Now we have the post-stratified results, we should compare with Hoftstede Cultural Dimensions.

library(reshape)
library(coop)

# helper 
# https://stats.stackexchange.com/questions/31565/compute-a-cosine-dissimilarity-matrix-in-r
# cos.sim <- function(ix) {
#     A = X[ix[1],]
#     B = X[ix[2],]
#     return( sum(A*B)/sqrt(sum(A^2)*sum(B^2)) )
# }  

euclidean <- function(a, b) sqrt(sum((a - b)^2))
# We create a dataframe where we calculate the cosine similarity between two countries
hofstedeDf <- read.csv("hoftstede.csv")
hofstedeDf <- hofstedeDf %>% select(-X) %>%
  mutate(power.distance = power.distance/100,
         individualism = individualism/100,
         masculinity = masculinity/100,
         uncertainty = uncertainty/100, 
         long.term.orientation = long.term.orientation/100,
         indulgence = indulgence/100)  # scale the hofstede score to 0 - 1

# Create a matrix with cosine similarity between countries
hofstedeMatrix = as.matrix(hofstedeDf[,2:7])
m <- dist(hofstedeMatrix, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)  # switch to Euclidean distance
m <- as.matrix(m, nrow=16, ncol=16)
rownames(m) <- hofstedeDf$country
colnames(m) <- hofstedeDf$country
cultureDf <- data.frame(countryA=rownames(m)[row(m)], countryB=colnames(m)[col(m)], culture.euclidean=c(m))
head(cultureDf)
##       countryA countryB culture.euclidean
## 1        India    India         0.0000000
## 2    Singapore    India         0.5605355
## 3           US    India         0.8330066
## 4      Nigeria    India         0.8075890
## 5 South.Africa    India         0.6466065
## 6    Argentina    India         0.7785242
# This function takes a matrix where columns are countries
# This function returns the correlation between countries
# countryA, countryB, label.corr
calcCosDf <- function(full, label) {
  full
  label
  # mx <- cor(full, use = "pairwise.complete.obs")
  mx <- as.matrix(full)
  m <- cosine(mx)
 
  retDf <- data.frame(countryA=rownames(m)[row(m)], countryB=colnames(m)[col(m)], label.cos.sim=c(m))
  retDf$label <- label

  return(retDf)
}

# Euclidean distance calculation
calEuclidean <- function(full, label) {
  full
  label
  # mx <- cor(full, use = "pairwise.complete.obs")
  mx <- t(as.matrix(full))
  m <- dist(mx, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
  m <- as.matrix(m, nrow=16, ncol=16)
 
  retDf <- data.frame(countryA=rownames(m)[row(m)], countryB=colnames(m)[col(m)], label.euclidean=c(m))
  retDf$label <- label

  return(retDf)
}
# This is calculating the distance between game vectors (each dot is represented by game)
# labels currently using the binary indicator variable
countries <- unique(df$country)
gameOrder <- unique(df$game[!is.na(df$game)])
output <- data.frame(countryA=c(), countryB=c(), label.cos.sim=c(), label=c())

for(l in labels) {
  labelSpace <- data.frame(matrix(ncol = 0, nrow=11))
  for(country in countries) {
    dfA <- final.df %>% filter(country == !!country & label == !!l) 
    dfA <- dfA %>% arrange(factor(game, levels=gameOrder))
    v <- as.vector(dfA$mrp_pred)
    labelSpace <- cbind(labelSpace, as.data.frame(v))
    names(labelSpace)[names(labelSpace) == "v"] <- country
  }
  output <- rbind(calEuclidean(labelSpace, l), output)
}
# This is calculating the distance between label vectors (each dot is represented by labels)
# labels currently using the binary indicator variable
countries <- unique(df$country)
gameOrder <- unique(df$game[!is.na(df$game)])
labelOrder <- c("pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn", "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output.game <- data.frame(countryA=c(), countryB=c(), label.cos.sim=c(), label=c())

for(game in gameOrder) {
  gameSpace <- data.frame(matrix(ncol=0, nrow=24))
  for(country in countries) {
    dfB <- final.df %>% filter(country == !!country & game == !!game) 
    
    dfB <- dfB %>% arrange(factor(label, levels=labelOrder))
    v <- dfB$mrp_pred
    gameSpace <- cbind(gameSpace, as.data.frame(v))
    names(gameSpace)[names(gameSpace) == "v"] <- country
  }
  output.game <- rbind(calEuclidean(gameSpace, game))
}
output.game
##         countryA     countryB label.euclidean      label
## 1             US           US      0.00000000 The Sims 3
## 2          India           US              NA The Sims 3
## 3      Singapore           US              NA The Sims 3
## 4   South.Africa           US      0.22505351 The Sims 3
## 5        Nigeria           US      0.12879605 The Sims 3
## 6      Argentina           US      0.13634817 The Sims 3
## 7         Brazil           US      0.24325714 The Sims 3
## 8          Chile           US      0.16535898 The Sims 3
## 9       Colombia           US      0.34438118 The Sims 3
## 10       Germany           US      0.14137107 The Sims 3
## 11        Greece           US      0.22707679 The Sims 3
## 12         Japan           US      0.30844855 The Sims 3
## 13         Korea           US      0.16904016 The Sims 3
## 14        Mexico           US      0.22531273 The Sims 3
## 15        Poland           US      0.21986125 The Sims 3
## 16  Saudi.Arabia           US              NA The Sims 3
## 17            US        India              NA The Sims 3
## 18         India        India      0.00000000 The Sims 3
## 19     Singapore        India              NA The Sims 3
## 20  South.Africa        India              NA The Sims 3
## 21       Nigeria        India              NA The Sims 3
## 22     Argentina        India              NA The Sims 3
## 23        Brazil        India              NA The Sims 3
## 24         Chile        India              NA The Sims 3
## 25      Colombia        India              NA The Sims 3
## 26       Germany        India              NA The Sims 3
## 27        Greece        India              NA The Sims 3
## 28         Japan        India              NA The Sims 3
## 29         Korea        India              NA The Sims 3
## 30        Mexico        India              NA The Sims 3
## 31        Poland        India              NA The Sims 3
## 32  Saudi.Arabia        India              NA The Sims 3
## 33            US    Singapore              NA The Sims 3
## 34         India    Singapore              NA The Sims 3
## 35     Singapore    Singapore      0.00000000 The Sims 3
## 36  South.Africa    Singapore              NA The Sims 3
## 37       Nigeria    Singapore              NA The Sims 3
## 38     Argentina    Singapore              NA The Sims 3
## 39        Brazil    Singapore              NA The Sims 3
## 40         Chile    Singapore              NA The Sims 3
## 41      Colombia    Singapore              NA The Sims 3
## 42       Germany    Singapore              NA The Sims 3
## 43        Greece    Singapore              NA The Sims 3
## 44         Japan    Singapore              NA The Sims 3
## 45         Korea    Singapore              NA The Sims 3
## 46        Mexico    Singapore              NA The Sims 3
## 47        Poland    Singapore              NA The Sims 3
## 48  Saudi.Arabia    Singapore              NA The Sims 3
## 49            US South.Africa      0.22505351 The Sims 3
## 50         India South.Africa              NA The Sims 3
## 51     Singapore South.Africa              NA The Sims 3
## 52  South.Africa South.Africa      0.00000000 The Sims 3
## 53       Nigeria South.Africa      0.28953504 The Sims 3
## 54     Argentina South.Africa      0.25067356 The Sims 3
## 55        Brazil South.Africa      0.28624684 The Sims 3
## 56         Chile South.Africa      0.28827462 The Sims 3
## 57      Colombia South.Africa      0.45228434 The Sims 3
## 58       Germany South.Africa      0.27667318 The Sims 3
## 59        Greece South.Africa      0.34348600 The Sims 3
## 60         Japan South.Africa      0.40511123 The Sims 3
## 61         Korea South.Africa      0.33625561 The Sims 3
## 62        Mexico South.Africa      0.35766063 The Sims 3
## 63        Poland South.Africa      0.31539808 The Sims 3
## 64  Saudi.Arabia South.Africa              NA The Sims 3
## 65            US      Nigeria      0.12879605 The Sims 3
## 66         India      Nigeria              NA The Sims 3
## 67     Singapore      Nigeria              NA The Sims 3
## 68  South.Africa      Nigeria      0.28953504 The Sims 3
## 69       Nigeria      Nigeria      0.00000000 The Sims 3
## 70     Argentina      Nigeria      0.09978512 The Sims 3
## 71        Brazil      Nigeria      0.23442456 The Sims 3
## 72         Chile      Nigeria      0.12017789 The Sims 3
## 73      Colombia      Nigeria      0.23080710 The Sims 3
## 74       Germany      Nigeria      0.14342275 The Sims 3
## 75        Greece      Nigeria      0.15614066 The Sims 3
## 76         Japan      Nigeria      0.35181560 The Sims 3
## 77         Korea      Nigeria      0.12211008 The Sims 3
## 78        Mexico      Nigeria      0.18390054 The Sims 3
## 79        Poland      Nigeria      0.14279187 The Sims 3
## 80  Saudi.Arabia      Nigeria              NA The Sims 3
## 81            US    Argentina      0.13634817 The Sims 3
## 82         India    Argentina              NA The Sims 3
## 83     Singapore    Argentina              NA The Sims 3
## 84  South.Africa    Argentina      0.25067356 The Sims 3
## 85       Nigeria    Argentina      0.09978512 The Sims 3
## 86     Argentina    Argentina      0.00000000 The Sims 3
## 87        Brazil    Argentina      0.20612227 The Sims 3
## 88         Chile    Argentina      0.10455169 The Sims 3
## 89      Colombia    Argentina      0.27517836 The Sims 3
## 90       Germany    Argentina      0.15761274 The Sims 3
## 91        Greece    Argentina      0.17507426 The Sims 3
## 92         Japan    Argentina      0.31166698 The Sims 3
## 93         Korea    Argentina      0.20616144 The Sims 3
## 94        Mexico    Argentina      0.18548770 The Sims 3
## 95        Poland    Argentina      0.15434329 The Sims 3
## 96  Saudi.Arabia    Argentina              NA The Sims 3
## 97            US       Brazil      0.24325714 The Sims 3
## 98         India       Brazil              NA The Sims 3
## 99     Singapore       Brazil              NA The Sims 3
## 100 South.Africa       Brazil      0.28624684 The Sims 3
## 101      Nigeria       Brazil      0.23442456 The Sims 3
## 102    Argentina       Brazil      0.20612227 The Sims 3
## 103       Brazil       Brazil      0.00000000 The Sims 3
## 104        Chile       Brazil      0.20585476 The Sims 3
## 105     Colombia       Brazil      0.38811534 The Sims 3
## 106      Germany       Brazil      0.20194577 The Sims 3
## 107       Greece       Brazil      0.24443416 The Sims 3
## 108        Japan       Brazil      0.39765705 The Sims 3
## 109        Korea       Brazil      0.29212015 The Sims 3
## 110       Mexico       Brazil      0.35400213 The Sims 3
## 111       Poland       Brazil      0.21170219 The Sims 3
## 112 Saudi.Arabia       Brazil              NA The Sims 3
## 113           US        Chile      0.16535898 The Sims 3
## 114        India        Chile              NA The Sims 3
## 115    Singapore        Chile              NA The Sims 3
## 116 South.Africa        Chile      0.28827462 The Sims 3
## 117      Nigeria        Chile      0.12017789 The Sims 3
## 118    Argentina        Chile      0.10455169 The Sims 3
## 119       Brazil        Chile      0.20585476 The Sims 3
## 120        Chile        Chile      0.00000000 The Sims 3
## 121     Colombia        Chile      0.27053315 The Sims 3
## 122      Germany        Chile      0.18224458 The Sims 3
## 123       Greece        Chile      0.16543564 The Sims 3
## 124        Japan        Chile      0.37104157 The Sims 3
## 125        Korea        Chile      0.21596757 The Sims 3
## 126       Mexico        Chile      0.24034773 The Sims 3
## 127       Poland        Chile      0.16054306 The Sims 3
## 128 Saudi.Arabia        Chile              NA The Sims 3
## 129           US     Colombia      0.34438118 The Sims 3
## 130        India     Colombia              NA The Sims 3
## 131    Singapore     Colombia              NA The Sims 3
## 132 South.Africa     Colombia      0.45228434 The Sims 3
## 133      Nigeria     Colombia      0.23080710 The Sims 3
## 134    Argentina     Colombia      0.27517836 The Sims 3
## 135       Brazil     Colombia      0.38811534 The Sims 3
## 136        Chile     Colombia      0.27053315 The Sims 3
## 137     Colombia     Colombia      0.00000000 The Sims 3
## 138      Germany     Colombia      0.34420513 The Sims 3
## 139       Greece     Colombia      0.29060349 The Sims 3
## 140        Japan     Colombia      0.51195050 The Sims 3
## 141        Korea     Colombia      0.27418658 The Sims 3
## 142       Mexico     Colombia      0.26442587 The Sims 3
## 143       Poland     Colombia      0.28583090 The Sims 3
## 144 Saudi.Arabia     Colombia              NA The Sims 3
## 145           US      Germany      0.14137107 The Sims 3
## 146        India      Germany              NA The Sims 3
## 147    Singapore      Germany              NA The Sims 3
## 148 South.Africa      Germany      0.27667318 The Sims 3
## 149      Nigeria      Germany      0.14342275 The Sims 3
## 150    Argentina      Germany      0.15761274 The Sims 3
## 151       Brazil      Germany      0.20194577 The Sims 3
## 152        Chile      Germany      0.18224458 The Sims 3
## 153     Colombia      Germany      0.34420513 The Sims 3
## 154      Germany      Germany      0.00000000 The Sims 3
## 155       Greece      Germany      0.22596416 The Sims 3
## 156        Japan      Germany      0.29361145 The Sims 3
## 157        Korea      Germany      0.17397507 The Sims 3
## 158       Mexico      Germany      0.28334712 The Sims 3
## 159       Poland      Germany      0.17409191 The Sims 3
## 160 Saudi.Arabia      Germany              NA The Sims 3
## 161           US       Greece      0.22707679 The Sims 3
## 162        India       Greece              NA The Sims 3
## 163    Singapore       Greece              NA The Sims 3
## 164 South.Africa       Greece      0.34348600 The Sims 3
## 165      Nigeria       Greece      0.15614066 The Sims 3
## 166    Argentina       Greece      0.17507426 The Sims 3
## 167       Brazil       Greece      0.24443416 The Sims 3
## 168        Chile       Greece      0.16543564 The Sims 3
## 169     Colombia       Greece      0.29060349 The Sims 3
## 170      Germany       Greece      0.22596416 The Sims 3
## 171       Greece       Greece      0.00000000 The Sims 3
## 172        Japan       Greece      0.40630023 The Sims 3
## 173        Korea       Greece      0.20192862 The Sims 3
## 174       Mexico       Greece      0.25495132 The Sims 3
## 175       Poland       Greece      0.12423208 The Sims 3
## 176 Saudi.Arabia       Greece              NA The Sims 3
## 177           US        Japan      0.30844855 The Sims 3
## 178        India        Japan              NA The Sims 3
## 179    Singapore        Japan              NA The Sims 3
## 180 South.Africa        Japan      0.40511123 The Sims 3
## 181      Nigeria        Japan      0.35181560 The Sims 3
## 182    Argentina        Japan      0.31166698 The Sims 3
## 183       Brazil        Japan      0.39765705 The Sims 3
## 184        Chile        Japan      0.37104157 The Sims 3
## 185     Colombia        Japan      0.51195050 The Sims 3
## 186      Germany        Japan      0.29361145 The Sims 3
## 187       Greece        Japan      0.40630023 The Sims 3
## 188        Japan        Japan      0.00000000 The Sims 3
## 189        Korea        Japan      0.37879009 The Sims 3
## 190       Mexico        Japan      0.37186489 The Sims 3
## 191       Poland        Japan      0.39004053 The Sims 3
## 192 Saudi.Arabia        Japan              NA The Sims 3
## 193           US        Korea      0.16904016 The Sims 3
## 194        India        Korea              NA The Sims 3
## 195    Singapore        Korea              NA The Sims 3
## 196 South.Africa        Korea      0.33625561 The Sims 3
## 197      Nigeria        Korea      0.12211008 The Sims 3
## 198    Argentina        Korea      0.20616144 The Sims 3
## 199       Brazil        Korea      0.29212015 The Sims 3
## 200        Chile        Korea      0.21596757 The Sims 3
## 201     Colombia        Korea      0.27418658 The Sims 3
## 202      Germany        Korea      0.17397507 The Sims 3
## 203       Greece        Korea      0.20192862 The Sims 3
## 204        Japan        Korea      0.37879009 The Sims 3
## 205        Korea        Korea      0.00000000 The Sims 3
## 206       Mexico        Korea      0.24795509 The Sims 3
## 207       Poland        Korea      0.20564284 The Sims 3
## 208 Saudi.Arabia        Korea              NA The Sims 3
## 209           US       Mexico      0.22531273 The Sims 3
## 210        India       Mexico              NA The Sims 3
## 211    Singapore       Mexico              NA The Sims 3
## 212 South.Africa       Mexico      0.35766063 The Sims 3
## 213      Nigeria       Mexico      0.18390054 The Sims 3
## 214    Argentina       Mexico      0.18548770 The Sims 3
## 215       Brazil       Mexico      0.35400213 The Sims 3
## 216        Chile       Mexico      0.24034773 The Sims 3
## 217     Colombia       Mexico      0.26442587 The Sims 3
## 218      Germany       Mexico      0.28334712 The Sims 3
## 219       Greece       Mexico      0.25495132 The Sims 3
## 220        Japan       Mexico      0.37186489 The Sims 3
## 221        Korea       Mexico      0.24795509 The Sims 3
## 222       Mexico       Mexico      0.00000000 The Sims 3
## 223       Poland       Mexico      0.26123688 The Sims 3
## 224 Saudi.Arabia       Mexico              NA The Sims 3
## 225           US       Poland      0.21986125 The Sims 3
## 226        India       Poland              NA The Sims 3
## 227    Singapore       Poland              NA The Sims 3
## 228 South.Africa       Poland      0.31539808 The Sims 3
## 229      Nigeria       Poland      0.14279187 The Sims 3
## 230    Argentina       Poland      0.15434329 The Sims 3
## 231       Brazil       Poland      0.21170219 The Sims 3
## 232        Chile       Poland      0.16054306 The Sims 3
## 233     Colombia       Poland      0.28583090 The Sims 3
## 234      Germany       Poland      0.17409191 The Sims 3
## 235       Greece       Poland      0.12423208 The Sims 3
## 236        Japan       Poland      0.39004053 The Sims 3
## 237        Korea       Poland      0.20564284 The Sims 3
## 238       Mexico       Poland      0.26123688 The Sims 3
## 239       Poland       Poland      0.00000000 The Sims 3
## 240 Saudi.Arabia       Poland              NA The Sims 3
## 241           US Saudi.Arabia              NA The Sims 3
## 242        India Saudi.Arabia              NA The Sims 3
## 243    Singapore Saudi.Arabia              NA The Sims 3
## 244 South.Africa Saudi.Arabia              NA The Sims 3
## 245      Nigeria Saudi.Arabia              NA The Sims 3
## 246    Argentina Saudi.Arabia              NA The Sims 3
## 247       Brazil Saudi.Arabia              NA The Sims 3
## 248        Chile Saudi.Arabia              NA The Sims 3
## 249     Colombia Saudi.Arabia              NA The Sims 3
## 250      Germany Saudi.Arabia              NA The Sims 3
## 251       Greece Saudi.Arabia              NA The Sims 3
## 252        Japan Saudi.Arabia              NA The Sims 3
## 253        Korea Saudi.Arabia              NA The Sims 3
## 254       Mexico Saudi.Arabia              NA The Sims 3
## 255       Poland Saudi.Arabia              NA The Sims 3
## 256 Saudi.Arabia Saudi.Arabia      0.00000000 The Sims 3
# Plotting the correlation
combined <- cultureDf %>% left_join(output, by=c("countryA", "countryB")) %>%
  filter(countryA != countryB) %>%
  filter(!duplicated(paste0(pmax(countryA, countryB), pmin(countryA, countryB), label)))

ggscatter(combined, x = "culture.euclidean", y = "label.euclidean", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "spearman",
          xlab = "Hofstede Cosine Similarity", ylab = "Label Cosine Similarity")
## `geom_smooth()` using formula 'y ~ x'

# Plotting the correlation
combined <- cultureDf %>% left_join(output.game, by=c("countryA", "countryB")) %>%
  filter(countryA != countryB) %>%
  filter(!duplicated(paste0(pmax(countryA, countryB), pmin(countryA, countryB), label)))

ggscatter(combined, x = "culture.euclidean", y = "label.euclidean", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "spearman",
          xlab = "Hofstede Euclidean Distance", ylab = "Label Euclidean Distance")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 42 rows containing non-finite values (stat_smooth).
## Warning: Removed 42 rows containing non-finite values (stat_cor).
## Warning: Removed 42 rows containing missing values (geom_point).

X-axis is the Euclidean distance of the Hofstede’s dimension between country pairs. Y-axis is the Euclidean distance of two 11-dimensional vectors between two countries. The 24-dimensional vectors refer to 0-1 binary labels (e.g., [pacifist, violent, …]) for a specific game.

5 Figure 3: If label outcome differences are found across countries, is there a data-backed explanation relating to culture?

To answer this question, we use the Hofstede’s theory of cultural dimensions. This question answers whether culturally “closer” countries have more similar survey results.

  • Define ‘culture’ using Hofstede’s cultural dimension map and calculate cultural similarity scores between pairs of countries
  • Calculate correlations between sampled survey responses for individuals across pairs of countries
library(tidyverse)
library(tidytext)
library(data.table)
library("ggpubr")
source("read.R")

theme_set(
  theme_bw() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size=8))
)
labels <- unique(df$label)
# 1. Find the pairwise correlation between countries based on the Hofstede's framework
# correlation based on six metrics: power.distance, individualism, masculinity, uncertainty
# long.term.orientation, and indulgence
library(reshape)
hofstedeDf <- read.csv("hoftstede.csv")
hofstedeDf <- hofstedeDf %>% select(-X) %>%
  mutate(power.distance = power.distance / 100,
         individualism = individualism / 100,
         masculinity = masculinity / 100,
         uncertainty = uncertainty / 100, 
         long.term.orientation = long.term.orientation / 100,
         indulgence = indulgence / 100)  # scale the hofstede score to 0 - 1
corrMatrix <- cor(t(hofstedeDf[, -1]))  # remove the first column with countries

corrMatrix[upper.tri(corrMatrix, diag = TRUE)] <- NA
rownames(corrMatrix) <- colnames(corrMatrix) <- hofstedeDf$country
corrMatrix <- na.omit(reshape::melt(t(corrMatrix)))
corrMatrix <- corrMatrix[ order(corrMatrix$X1, corrMatrix$X2), ]
hofstedeCorr <- corrMatrix %>% select(X1, X2, value) %>% dplyr::rename(countryA=X1, countryB=X2, hofstede.corr=value)
hofstedeCorr$countryA <- as.character(hofstedeCorr$countryA)
hofstedeCorr$countryB <- as.character(hofstedeCorr$countryB)
hofstedeCorr <- transform(hofstedeCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))
head(hofstedeCorr)
##     countryA     countryB hofstede.corr
## 17     India    Singapore    0.44105595
## 33     India           US   -0.40050181
## 49     India      Nigeria   -0.06628162
## 65     India South.Africa   -0.48927712
## 81 Argentina        India   -0.35943139
## 97    Brazil        India    0.18844511
# Calculate correlations between sampled survey responses for individuals across pairs of countries
rnormt <- function(n, range, mu, s) {
  # range is a vector of two values
  F.a <- pnorm(min(range), mean = mu, sd = s)
  F.b <- pnorm(max(range), mean = mu, sd = s)
  
  u <- runif(n, min = F.a, max = F.b)
  return(qnorm(u, mean = mu, sd = s))
}

countryCorr <- df %>% select(Response.ID, game, country, label, answer)
# Responses that are lower than 5 participants per game 
summary <- countryCorr %>% filter(label == "learning_curve") %>%  
  group_by(country, game) %>%
  summarise(n=n(), mean=mean(answer), sd=sd(answer)) %>%
  ungroup() %>% 
  filter(n < 5)

# games that we will ignore altogether
gameTooFew <- unique(summary$game)
gameTooFew
## [1] "The Sims 3"                    "Animal Crossing: New Horizons"
## [3] "Mario Kart 8"                  "Stardew Valley"               
## [5] "Candy Crush"
# This function takes a matrix where columns are countries
# This function returns the correlation between countries
# countryA, countryB, label.corr
corr.df <- function(full, label) {
  mx <- cor(full, use = "complete.obs")
  mx[upper.tri(mx, diag = TRUE)] <- NA
  rownames(mx) <- colnames(mx)
  mx <- na.omit(reshape::melt(t(mx)))
  mx <- mx[ order(mx$X1, mx$X2), ]
  labelCorr <- mx %>% select(X1, X2, value)
  labelCorr <- labelCorr %>% dplyr::rename(countryA=X1, countryB=X2, label.corr=value)
  labelCorr$countryA <- as.character(labelCorr$countryA)
  labelCorr$countryB <- as.character(labelCorr$countryB)
  labelCorr <- transform(labelCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))
  labelCorr$label <- label
  return(labelCorr)
}

5.1 Game (11) * Label (28) vector correlation vs. Hofstede

Result: All the label correlation are very high, so this might not be what we want

# This step forms a game vector with 11 dimension per label/country with means
gameVector = function(inputrnormtA, inputCountry, inputLabel, lower, upper) {
  inputCountry
  inputLabel
  dfA <- df %>% filter(label == !!inputLabel) %>%
    mutate(answer = (1-0) * (answer-!!lower) / (!!upper-!!lower) + 0) %>%  # the filter and mutate_at order is important and cannot be switched
    filter(country == !!inputCountry) %>% 
    # filter(!game %in% gameTooFew) %>%
    group_by(game) %>%
    dplyr::summarise(mean=mean(answer), sd=sd(answer)) 
  
  filterDf <- df %>% filter(label == !!inputLabel & country == !!inputCountry) %>%
    group_by(game) %>% 
    filter(n() >= 5) %>%
    count()
  dfA <- dfA %>% mutate(mean=ifelse(!game %in% filterDf$game, NA, dfA$mean))
  rnormtA <- inputrnormtA
  
  return(c(rnormtA, c(dfA$mean)))
}
countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space",
            "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn",
            "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output <- data.frame(countryA=c(), countryB=c(), label.corr=c(), label=c())
labelSpace <- data.frame(matrix(ncol=0, nrow=11))

for(label in labels) {
  labelSpace <- data.frame(matrix(ncol=0, nrow=11))
  
  for(country in countries) {
    v <- c()
    if(label %in% c("control_complexity", "learning_curve", "replayability")) {
      v <- gameVector(v, country, label, lower=1, upper=3)
    } else if (label == "difficulty") {
      v <- gameVector(v, country, label, lower=1, upper=4)
    } else {
      v <- gameVector(v, country, label, lower=0, upper=1)
    }
    labelSpace <- cbind(labelSpace, as.data.frame(v))
    names(labelSpace)[names(labelSpace) == "v"] <- country
  }
  output <- rbind(labelSpace, output)
}
# Plot label correlation against the Hofstede scores
# Create a matrix where columns are countryA, countryB, and label.corr
mx <- cor(labelSpace, use = "pairwise.complete.obs")
mx[upper.tri(mx, diag = TRUE)] <- NA
rownames(mx) <- colnames(mx) 
mx <- na.omit(reshape::melt(t(mx)))
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE

## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
mx <- mx[ order(mx$X1, mx$X2), ]
labelCorr <- mx %>% select(X1, X2, value)
labelCorr <- labelCorr %>% dplyr::rename(countryA=X1, countryB=X2, label.corr=value)
labelCorr$countryA <- as.character(labelCorr$countryA)
labelCorr$countryB <- as.character(labelCorr$countryB)
labelCorr <- transform(labelCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))
# Plotting the correlation
combined <- hofstedeCorr %>% left_join(labelCorr, by=c("countryA", "countryB")) 
ggscatter(combined, x = "hofstede.corr", y = "label.corr", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "spearman",
          xlab = "Hofstede", ylab = "Label")
## `geom_smooth()` using formula 'y ~ x'

5.2 Game (11) vector correlation vs. Hofstede

This one compare a 11-dimensional vector between countries for each label. So each dot is a comparison between two countries for a label (e.g., US vs Japan for control_complexity)

updateVector = function(inputCountry, inputLabel, lower=0, upper=1) {
  inputCountry
  inputLabel
  dfA <- df %>% filter(label == {{ inputLabel }}) %>%
    mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>%  # the filter and mutate_at order is important and cannot be switched
    filter(country == {{ inputCountry }}) %>% 
    filter(!game %in% gameTooFew) %>%
    group_by(game) %>%
    dplyr::summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE)) 
  
  rnormtA <- c()
  for(row in 1:nrow(dfA)) {
    meanA <- as.double(dfA[row, "mean"])
    sdA <- as.double(dfA[row, "sd"])
    
    rnormtA <- c(rnormtA, meanA)
  }
  return(rnormtA)
}
countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space",
            "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn",
             "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output <- data.frame(countryA=c(), countryB=c(), label.corr=c(), label=c())
for(label in labels) {
  v <- c()
  labelSpace <- data.frame(matrix(ncol = 0, nrow = 600))
  for(country in countries) {
    if(label %in% c("control_complexity", "learning_curve", "replayability")) {
      v <- updateVector(country, label, lower=1, upper=3)
    } else if (label == "difficulty") {
      v <- updateVector(country, label, lower=1, upper=4)
    } else {
      v <- updateVector(country, label, lower=0, upper=1)
    }
    
    labelSpace <- cbind(labelSpace, as.data.frame(v))
    names(labelSpace)[names(labelSpace) == "v"] <- country
  }
  #break
  output <- rbind(corr.df(labelSpace, label), output)
}
# Now plotting the two correlations 
combined <- hofstedeCorr %>% left_join(output, by=c("countryA", "countryB")) 

ggscatter(combined, x = "hofstede.corr", y = "label.corr", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "spearman",
          xlab = "Hofstede", ylab = "Label")
## `geom_smooth()` using formula 'y ~ x'

5.3 Sampling approach 1

For each country and label, we sample 100 points based on mean and sd. Overall, each dot is a 6-dimensional game vector for each label between two countries. (control_complexity: [PUBG, …])

updateSingleVector = function(inputCountry, inputLabel, lower=0, upper=1) {
  inputCountry
  inputLabel
  dfA <- df %>% filter(label == {{ inputLabel }}) %>%
    mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>%  # the filter and mutate_at order is important and cannot be switched
    filter(country == {{ inputCountry }}) %>% 
    filter(!game %in% gameTooFew) %>%
    group_by(game) %>%
    dplyr::summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE)) 
  
  rnormtA <- c()
  for(row in 1:nrow(dfA)) {
    meanA <- as.double(dfA[row, "mean"])
    sdA <- as.double(dfA[row, "sd"])
    gameA = meanA
    
    if(sdA == 0) {
      gameA <- rep(meanA, 100)
    } else {
      gameA <- rnormt(100, c(0, 1), meanA, sdA)
    }
    rnormtA <- c(rnormtA, gameA)
  }
  return(rnormtA)
}
countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space",
            "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn",
             "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output <- data.frame(countryA=c(), countryB=c(), label.corr=c(), label=c())
for(label in labels) {
  v <- c()
  labelSpace <- data.frame(matrix(ncol = 0, nrow = 600))
  for(country in countries) {
    if(label %in% c("control_complexity", "learning_curve", "replayability")) {
      v <- updateSingleVector(country, label, lower=1, upper=3)
    } else if (label == "difficulty") {
      v <- updateSingleVector(country, label, lower=1, upper=4)
    } else {
      v <- updateSingleVector(country, label, lower=0, upper=1)
    }
    
    labelSpace <- cbind(labelSpace, as.data.frame(v))
    names(labelSpace)[names(labelSpace) == "v"] <- country
  }
  #break
  output <- rbind(corr.df(labelSpace, label), output)
}
# Now plotting the two correlations 
combined <- hofstedeCorr %>% left_join(output, by=c("countryA", "countryB")) 

ggscatter(combined, x = "hofstede.corr", y = "label.corr", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "spearman",
          xlab = "Hofstede", ylab = "Label")
## `geom_smooth()` using formula 'y ~ x'

5.4 Sampling 2: Game (11) * Label (28) vector correlation + sampling

Each vector is 11 games * 28 labels * 100

# Compare "label space" vector
updateVector <- function(inputCountry, inputrnormtA, inputLabel, lower=0, upper=1) {
  # update the vector of a country based on a specific label
  # Args: 
  #   country: string of countries
  #   label: string of the label of interest
  #   lower, upper: the lower and upper bound (0-1 for binary, 0-3 for control_complexity etc, 0-4 for difficulty)
  # Returns:
  #   an updated vector in the label space
  inputLabel
  inputCountry
  if(upper - lower == 0) {
    print("Something happened")
  }
  
  dfA <- df %>% filter(label == {{ inputLabel }}) %>%
    mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>%  # the filter and mutate_at order is important and cannot be switched
    filter(country == {{ inputCountry }}) %>% 
    filter(!game %in% gameTooFew) %>%
    group_by(game) %>%
    dplyr::summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE)) 
  
  rnormtA <- inputrnormtA
  for(row in 1:nrow(dfA)) {
    meanA <- as.double(dfA[row, "mean"])
    sdA <- as.double(dfA[row, "sd"])
    
    if(sdA == 0) {
      gameA <- rep(meanA, 100)
    } else {
      gameA <- rnormt(100, c(0, 1), meanA, sdA) 
    }
    rnormtA <- c(rnormtA, gameA)
  }
  
  return(rnormtA)
}

labelSpace <- data.frame(matrix(ncol = 0, nrow = 28*600))
countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent", 
             "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn", 
             "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
for(country in countries) {
  v <- c()
  for(label in labels) {
   
    if(label %in% c("control_complexity", "learning_curve", "replayability")) {
      v <- updateVector(country, v, label, lower=1, upper=3)
    } else if (label == "difficulty") {
      v <- updateVector(country, v, label, lower=1, upper=4)
    } else {
      v <- updateVector(country, v, label, lower=0, upper=1)
    }
  }
  labelSpace <- cbind(labelSpace, as.data.frame(v))
  names(labelSpace)[names(labelSpace) == "v"] <- country
}
library(corrplot)
mx <- cor(labelSpace, use = "pairwise.complete.obs")
corrplot(mx, type = "upper", order = "FPC", tl.cex=2,
         tl.col = "black", tl.srt = 45)

# Plot label correlation against the Hofstede scores
mx[upper.tri(mx, diag = TRUE)] <- NA
rownames(mx) <- colnames(mx) 
mx <- na.omit(reshape::melt(t(mx)))
mx <- mx[ order(mx$X1, mx$X2), ]
labelCorr <- mx %>% select(X1, X2, value)
labelCorr <- labelCorr %>% dplyr::rename(countryA=X1, countryB=X2, label.corr=value)
labelCorr$countryA <- as.character(labelCorr$countryA)
labelCorr$countryB <- as.character(labelCorr$countryB)
labelCorr <- transform(labelCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))
head(labelCorr)
##        countryA countryB label.corr
## 17        India       US  0.3071151
## 33    Singapore       US  0.3100672
## 49 South.Africa       US  0.3755622
## 65      Nigeria       US  0.2735908
## 81    Argentina       US  0.3725302
## 97       Brazil       US  0.3708676
  • This finding shows that there is no clear trend in the correlation between Hofstede’s dimensions of culture and the label correlations. It seems that the label correlation is roughly the same across these countries. In other words, the Hofstede’s dimensions of culture, which was observed in organizational culture at IBM, doesn’t seem to explain the difference well.
# Now plotting the two correlations 
combined <- hofstedeCorr %>% left_join(labelCorr, by=c("countryA", "countryB")) 

ggplot(aes(x=hofstede.corr, y=label.corr), data=combined) +
  geom_point() +
  ylim(c(0,1)) +
  geom_smooth(method='lm')
## `geom_smooth()` using formula 'y ~ x'

cor(combined$hofstede.corr, combined$label.corr, method = "spearman")
## [1] 0.0307313

The following takes a similar approach, but looks at the individual game level.

6 Figure 4: How similar are label choices for bilingual speakers when asked questions in their native language versus English?

library(tidyverse)
library(tidytext)
library(data.table)
source("read.R")

theme_set(
  theme_bw() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size=8))
)

df <- df %>% filter(!label %in% c("NA.positive.opinion", "NA.negative.opinion", "NA.feeling", "NA.art"))

We ran our surveys in both English and non-English languages on bilingual speakers within each non-English speaking country.

rnormt <- function(n, range, mu, s) {
  if(s == 0) {
    return(c(NaN))
  }
  # range is a vector of two values
  F.a <- pnorm(min(range), mean = mu, sd = s)
  F.b <- pnorm(max(range), mean = mu, sd = s)
  
  u <- runif(n, min = F.a, max = F.b)
  return(qnorm(u, mean = mu, sd = s))
}
# For some weird reason, I need to output the first four variable
compareLanguage <- function(languageA, languageB, inputCountries, label, lower=0, upper=1) {
  languageA 
  languageB
  inputCountries
  label
  dfA = df %>% dplyr::filter(label == {{ label }}) %>%  # select the label
    mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>%  # rescale the Likert scale answer to 0-1 (e.g., 1-3, 1-4 to 0-1)
    dplyr::filter(Language == {{ languageA }} & country %in% inputCountries) %>% 
    group_by(game) %>%
    summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE)) %>%
    filter(!is.na(sd))  # in case there is only one response in a country
  
  # same operation for languageB
  dfB = df %>% dplyr::filter(label == {{ label }}) %>%  # select the label
    mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>%  # rescale the Likert scale answer to 0-1 (e.g., 1-3, 1-4 to 0-1)
    dplyr::filter(Language == {{ languageB }} & country %in% inputCountries) %>% 
    group_by(game) %>%
    summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE)) %>%
    filter(!is.na(sd))  # in case there is only one response in a country

  rnormtA <- c()
  rnormtB <- c()

  for(row in unique(df$game)) {
    meanA <- as.double(dfA[dfA$game == row, "mean"])
    meanB <- as.double(dfB[dfB$game == row, "mean"])
    sdA <- as.double(dfA[dfA$game == row, "sd"])
    sdB <- as.double(dfB[dfB$game == row, "sd"])

    if(any(is.na(meanA)) | any(is.na(meanB))) next;
    if(any(is.na(sdA)) | any(is.na(sdB))) next;
    if(any(sdA == 0) | any(sdB == 0)) next;

    gameA <- rnormt(1000, c(0, 1), meanA, sdA)
    gameB <- rnormt(1000, c(0, 1), meanB, sdB)
    rnormtA <- c(rnormtA, gameA)
    rnormtB <- c(rnormtB, gameB)
  }
  
  # no game is available
  if(length(rnormtA) == 0) {
    return(data.frame(languageA = languageA, languageB = languageB, estimate=NaN, p=NaN, label=label))
  }

  cor.res = cor.test(rnormtA, rnormtB, method=c("pearson"), use = "complete.obs")
  return(data.frame(languageA = languageA, languageB = languageB, estimate=cor.res$estimate, p=cor.res$p.value, label=label))
}
matrix <- combn(unique(df$Language), 2)  # get the combination of languages
new.df <- data.frame(languageA=c(), languageB=c(), estimate=c(), p=c(), label=c())
for(col in 1:ncol(matrix)) {
  # for each combination (English vs. language X
  l1 <- matrix[1, col]  # English
  if(l1 != "English") break
  l2 <- matrix[2, col]  # language X
  language.countries <- df %>% filter(Language == l2)  # find the correpondings countries to language X
  
  for(label in c("control_complexity", "learning_curve", "replayability")) {
    comparison <- compareLanguage(l1, l2, c(unique(language.countries$country)), label, lower=1, upper=3)
    new.df <- rbind(new.df, comparison)
  }

  comparison <- compareLanguage(l1, l2, c(unique(language.countries$country)), "difficulty", lower=1, upper=4)
  new.df <- rbind(new.df, comparison)

  for(label in labels[!labels %in% c("control_complexity", "learning_curve", "difficulty", "replayability")]){
    comparison <- compareLanguage(l1, l2, c(unique(language.countries$country)), label, lower=0, upper=1)
    new.df <- rbind(new.df, comparison)  # update the new.df
  }
}
  • This result looks very cool as it indicates that the current translation of tags might be leaning toward popular markets. Japanese, Arabic, Greek markets are poorly translated.
new.df %>% filter(estimate < 0)
##        languageA languageB      estimate          p              label
## cor81    English    German -0.0015186309 0.87346612        achievement
## cor88    English     Greek -0.0028003688 0.87814680           pacifist
## cor90    English     Greek -0.0127911144 0.28460398               cozy
## cor94    English     Greek -0.0036384955 0.79701126         real.world
## cor97    English     Greek -0.0032613144 0.81766231          emotional
## cor101   English     Greek -0.0024696280 0.84832425           grinding
## cor103   English     Greek -0.0103806079 0.42143593         hand.drawn
## cor108   English     Greek -0.0052058916 0.64153084        achievement
## cor109   English     Greek -0.0151299980 0.28477862          immersion
## cor110   English     Greek -0.0084885782 0.64210821         creativity
## cor113   English  Japanese -0.0193274718 0.22166651      replayability
## cor115   English  Japanese -0.0720159174 0.02275921           pacifist
## cor116   English  Japanese -0.0234320142 0.29491504      made.for.kids
## cor117   English  Japanese -0.0082740802 0.71152920               cozy
## cor118   English  Japanese -0.0108711495 0.55170523                zen
## cor120   English  Japanese -0.0190434275 0.39466231             heroic
## cor124   English  Japanese -0.0063587958 0.84082892          emotional
## cor131   English  Japanese -0.0023162515 0.91754925  action.motivation
## cor132   English  Japanese -0.0008932378 0.96099559             social
## cor133   English  Japanese -0.0454120321 0.04228891            mastery
## cor134   English  Japanese -0.0261195416 0.24298013        achievement
## cor136   English  Japanese -0.0132795981 0.40110463         creativity
## cor154   English    Polish -0.0147580221 0.25304875           grinding
## cor155   English    Polish -0.0246806752 0.17654853         hand.drawn
## cor158   English    Polish -0.0093197112 0.47043775             social
## cor163   English    Arabic -0.0171182883 0.58872101 control_complexity
## cor164   English    Arabic -0.0154268056 0.62607602     learning_curve
## cor166   English    Arabic -0.0584431924 0.06468865         difficulty
## cor169   English    Arabic -0.0022168611 0.94418115            violent
## cor170   English    Arabic -0.0368267869 0.24462532          emotional